Obesity Level Dataset

The source of the data is the paper called Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico by Fabio Mendoza Palechor and Alexis de la Hoz Manotas from the Universidad de la Costa, CUC in Colombia, published in Elsevier journal and sourced from the UCI website. The paper presents data for the estimation of obesity levels of individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. To balance the dataset 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, 23% of the data was collected directly from users through a web platform.

Attributes related with eating habbits

The attributes related with the physicalcondition are:

other variables obtained were:

Descriptive analysis

Chart 1:

Chart 2:

Chart 3:

Chart 4:

Based on the short brief of data information,the categories of obesity levels were unbalanced in the original dataset. And, this presented a learning problem for the data mining methods, since it would learn to identify correctly the category with most records compared with the categories with less data. After the balancing class problem was identified, synthetic data was generated, up to 77% of the data, using the tool Weka and the filter SMOTE.

Label Encoding

re-name columns names on data_total

After converting the columns with float numbers into categorical variables, we use box plot to detect the outliers and see the connection between those variables with age and gender. The Gender value 0 represents Female and value 1 is Male.

Correlation & Multi-colinearity

Correlation matrix

EXPLAIN:

Later we won't put Weight and Height in our model since these two variables fomulate BMI score basing on which we categorized Obesity level. We eliminate Weight and Height from our analysis.

Two variables: BMI and Obesity are the targets of our predictive models, we don't look into correlation between these ones.

We notice some pair of variables which correlation scores are around + or - 0.3, 0.4 or 0.5, hinder mild linear relationship:

We will continue to examine with p-value and VIF later.

Otherwise, risk of multi-colinearity is quite low (feature depends on another)

Link to theory interpretation: https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/how-to/correlation/interpret-the-results/

EXPLAIN: obvious linear relationship between BMI and Weight, somewhat between Height and Weight.

Split data

VIF

In line with corelation matrix, VIF high in Vege: Vege and Gender have high cor-coef score. However, correlation matrix shows low corelation between Gender and our targets (obesity, BMI), we can consider elimiminating Gender out of our model.

VIF high in Age: Vege and Public_trans have high cor-coef score. However, correlation matrix shows low corelation between Public_trans and our targets (obesity, BMI), we can consider elimiminating Public_trans out of our model.

Supervised learning - Regression models

Scaling and dataset split

There aren't many outliers nor the dataset appears to have a normal distribution, therefore we don't choose the Robust scaler nor the Standard scaler, respectively, for the regression models part of our project. The MinMax scaler will be used to scale the data for further processing.

Then we devide the scaled dataset into train and test data in the ratio of 80:20. This split will be necessary for future evaluation of the models' prediction accuracy.

Model selection

We evaluate the models' performance in their default settings and clarify which of them perform the best on our dataset in general. Based on the results of the cross validation we select the best models to work on further and find the best performing algorithm and its setup for our dataset.

We pick the following algorithm for evaluation - linear regression, gradient descent, RANSAC, support vector machines, decision trees (with 3 various depths) and random forest.

Based on the cross validation score and standard deviation results of the basic models we decided to continue with the SVM, Decision tree and Random forest models for the feature selection and parameter tuning to find the optimal setup and select the best perfomaning model.

Feature selection

To improve the performance of the SVM model we decrease the amount of feature for our model using the mutual information for regression function. This should be the most suitable option to capture the target dependency as we're handling non-linear data and continuous target values. We selected only features with dependency higher than 0.05 as we want to avoid keeping highly independent features which don't help us in the predictions' accuracy and slow down our algorithm.

We can see that there're a few (highly) independent values which we can remove in the following steps. Also the data scaling didn't actually have a major impact on the original dataset.

Parameter tuning

In this section we tune the parameters for the earlier selected SVM, decision tree and random forest algorithms to find the best model for our data.

We use the GridSearch function to find the best parameter combination and compare the best score of the training data with the score on the test data to confirm the model isn't either underfitting or overfitting. We also use the MSE to improve our understanding of the models performance and ability to compare the result between all the evaluated models.

SVM

For the SVM model we selected various options of both the C and gamma parameters to find their best performing combination.

The SVM tuning gave us the combination of C=10 and gamma=10 as the most optimal. The accuracy of the model on the training and test sets is 76% and 80%, respectively. The MSE of the SVM model is 12.41.

Decision Tree

Although we expect the random forest perform better the decision tree model will help us understand the importance of various features. It also performed better than the SVM model with the default setup so we can clarify whether it'll be the same case also after the parameter tuning or not.

Decision Tree model's features importance

The Decision tree tuning gave us the combination of max_depth=11 and max_features=15 as the most optimal, using almost the maximum values we allowed. The accuracy of the model on the training and test sets is 74% and 70%, respectively. The MSE of the Decision tree model is 18.50. We can conclude that the model is already performing worse than the SVM model while it's also slightly overfitting on the training model. Therefore there's no need to increase the maximum depth parameter even more which would improve the model accuracy most likely but increase the overfitting at the same time.

From the feature importance we see again that also for the Decision tree algorithm the Family history and Age are among the most relevant features in the prediction process.

Random Forest

Random forest was the best performing model in the default settings selection. In this part we try several options for its parameters n_estimators, max_depth and max_features using the GridSearch function again to see which combination gives us the best performance.

The Random forest model tuning gave us the combination of n_estimators=400, max_depth=17 and max_features=5 as the most optimal. The accuracy of the model on the training and test sets is 86% and 86%, respectively. The MSE of the Random forest model is 8.80. The Random forest significantly outperformed all other models while avoiding any overfitting and shown to be the best solution for the regression prediction part of our dataset.

Supervised learning - Classification models

Model Selection

The models that give the high score are KNeighbor, DecisionTree, RandomForest and GradientBoosting. In the section below, we will use these four models to analyze the classfication of obesity dataset.

K-neigbors Classifier

image

k = 4 will perform the best for this model

The KNeigborClassifier will perform the best with k=4 and the accuracy is 75% and it improved in comparison to k=5 from the beginning (68% accuracy)

Decision Tree

Random Forest

In comparison to the feature importances of Decision Tree:

Gradient Boosting

image

Semi-supervised learning

Limited label data

With only 100 labels, the score is low with 52%.

Representative sample

In this part, the representative sample need to be found using Kmeans. We used inertia score to find the best k.

Give labels to other instances in the same cluster

Propagating the label to the 20 percentile closest to the centroid

The score on test set improved when propating the label to 20 percentile. However, the model or k still need to carefully consider again as the score is still low.

Unsupervised learning

In this part, we aims to classify to around balanced 7 groups as original dataset have balanced target variable.

Feature selection

Classification all categorical variable --> Y_cls

Variance thredhold for unsupervised learning

MinMaxScaler

PCA

detect outliners

As discussed in the EDA part, we do not have many outliners in our data. That is why the graph show at the end only few points MSE are high. We removed 31 outliner points.

TSNE

It is better to use all features with scaled data. Clustering methods can be "Gaussian Mixture" or "K-Means". Since colors are overlapped, clustering is likely very challenging.
This plot gives views of how colors matches with real labels (obesity) classes. It reveals a potential for clustering/classifying Obesity type III. However, overall, potentiality for clustering is poor.

Gaussian Mixture

all features

Covariance type "full" is the best compared with other types of covariance. Cluster 7 is the best choice.

all features + minmax scale

Covariance type "diag" is the best compared with other types of covariance. Cluster 7 is the best choice.

the best option =)

Compare between data with clustering labels and original labels, we can see some level of similar pattern of colors. If you compare with K-Mean (in the below part), Gaussian Mixture is better. Notice that in reality, we may not have original labels to compare

Bayesian GaussianMixture

K-Means

K-Mean fails to catch the pattern of original labels, even it looks like group the data well.

Clustering selection

Remember our goals at the begining of the unsupervised learning, we try to find clustering method that give balanced labels that are close to 7 clusters. These crietia can be judged via total number of labels, number of times (= number of observations in a cluster). Besides, the homogeneity level within a cluster is consider important, which will be measure by looking at the unique rows in a label.


Basing on above mentioned metrics and looking at the result showed above, we can conclude that GAUSSIAN MIXTURE ON SCALED FEATURES is the best method.


The plot in each part comparing between data with clustering label vs data with original labels also lead to the same conclusion. However, in unsupervised learning, we do not know the orginal labels.

Clasification model

Running Classification basing on the labels gaining from the best clustering method (Gaussian Mixture on scaled features). Model works quite well with labels from clustering. Ater tested on many models using cross validation 5 folds, SVC and Decision tree based models (Except AdaBoost) seem to be suitable to run.

Summary and conclusion

Supervised learning - regression

To understand the overall algorithms' suitability for our data we used the default settings of the Linear regression, RANSAC, Gradient descent, Supporting vector machine ("SVM"), Decision tree and Random forest models and selected SVM, Decision tree and Random forest for further parameter tuning as these achieved the best scores. The best performance could be achieved with the Random forest model which significantly outperformed other models and reached the 86% accuracy without overfitting on the training set.

Supervised learning - classification

We used Kneighbor Classifier, SVC, Logistic Regession, Decision Tree, Random Forest, AdaBoost Classifier, and GradientBoosting to compare the accuracy score without tunning for the model selection. And with the accuracy score higher than 0.7, I chose Kneighbor Classifier, Decision Tree, Random Forest, and Gradient Boosting to continue. After feature scaling and parameters tuning, Random Forsest give the best result with accuracy score reached 84% and also a balanced f1_score between labels. Therefore, Random Forest will be selected among others to be the model prediction for the Obesity Level classification task.

Semi-supervised learning

We choose Logistic Regression as the model to work on. With the limited 100 labeled data, the accuracy score is 52%. After finding the representative sample using Kmeans and giving the labels to other instances, the score decreases to 35%. To improve the score, propagating the label to the 50 percentile closest to the centroid is used and make the improvement with the accuracy score is 57%. However, it is still need further improvement on both Kmeans and model selection as the score is still lower than expected.

Unsupervised learning

We try to cluster around 7 balanced groups. We visualize our data using TSNE basing on original data, selected data from feature engineering, scaled and dimension deduction data, removed outliner data. First, we see that using feature engineer or remove outliners in this case do not bring significant impact, and secondly to identify relatively suitable clustering methods: Gaussian Mixture, Bayes Gaussian Mixture, and K-Means. After tuning parameters, Gaussian Mixture with scaled data input is the best option. The best output of clustering works well with SVC and Decision Tree, Random Forest, Gradient Boosting.